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Abstract 

We consider the problem of minimizing a sum of n functions via projected iterations onto 
a convex parameter set C C R p where n p 1. In this regime, algorithms which utilize 
sub-sampling techniques are known to be effective. In this paper, we use sub-sampling techniques 
together with eigenvalue thresholding to design a new randomized batch algorithm which possesses 
comparable convergence rate to Newton’s method, yet has much smaller per-iteration cost. The 
proposed algorithm is robust in terms of starting point and step size, and enjoys a composite 
convergence rate, namely, quadratic convergence at start and linear convergence when the iterate 
is close to the minimizer. We develop its theoretical analysis which also allows us to select near- 
optimal algorithm parameters. Our theoretical results can be used to obtain convergence rates 
of previously proposed sub-sampling based algorithms as well. We demonstrate how our results 
apply to well-known machine learning problems. Lastly, we evaluate the performance of our 
algorithm on several datasets under various scenarios. 


1 Introduction 

We consider the problem of minimizing an average of n functions f% : —> M, 

1 n 

minimize f{9) := - V fi{9 ), (1.1) 

e n z —' 

i=l 

in a batch setting, where n is assumed to be much larger than p. Most machine learning models 
can be expressed as above, where each function _/) corresponds to an observation. Examples include 
logistic regression, support vector machines, neural networks and graphical models. 

Many optimization algorithms have been developed to solve the above minimization problem 
using iterative methods |Bis95l IBV041 INes04| . In this paper, we consider the iterations of the 
following form 


o t+1 = 0*- mOtVofie*), ( 1 . 2 ) 

where r)t is the step size and Q # is a suitable scaling matrix that provides curvature information (For 
simplicity, we drop the projection throughout the introduction, i.e., we assume C = M p ). 
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Updates of the form Eq. (1.2) have been extensively studied in the optimization literature. The 
case where Q # is equal to the identity matrix corresponds to Gradient Descent (GD) which, un¬ 
der smoothness assumptions, achieves linear convergence rate with 0(np) per-iteration cost. More 
precisely, GD with ideal step size yields 


\\e t+1 -94 2 < £^-642, 

where, as lim^oo GD = 1 — an d K is * _ th largest eigenvalue of the Hessian of f(9) at 

minimizer (9*. 

Second order methods such as Newton’s Method (NM) and Natural Gradient Descent (NGD) 
[ Ama98j can be recovered by taking Q t to be the inverse Hessian and the Fisher information evaluated 
at the current iterate, respectively. Such methods may achieve quadratic convergence rates with 
0{np 2 + p 3 ) per-iteration cost | Bis95:. lNes'04 ]. In particular, for t large enough, Newton’s Method 
yields 

\\ 9 t+1 — 9 * || 2 < ^2,nmII^ — ^*111) 

and it is insensitive to the condition number of the Hessian. However, when the number of samples 
grows large, computation of Q* becomes extremely expensive. 

A popular line of research tries to construct the matrix Q* in a way that the update is com¬ 
putationally feasible, yet still provides sufficient second order information. Such attempts resulted 
in Quasi-Newton methods, in which only gradients and iterates are used in the construction of ma¬ 
trix Q*, resulting in an efficient update at each step t. A celebrated Quasi-Newton method is the 
Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm jBro70t lFle70l Gol70l ISha70 ] which requires 
0(np + p 2 ) per-iteration cost |Bis95l lNes04j . 

An alternative approach is to use sub-sampling techniques, where scaling matrix Q f is based on 
randomly selected set of data points [ MarlOl iBCNNlfl VI* 12 . Sub-sampling is widely used in the 
first order methods, but is not as well studied for approximating the scaling matrix. In particular, 
theoretical guarantees are still missing. 

A key challenge is that the sub-sampled Hessian is close to the actual Hessian along the directions 
corresponding to large eigenvalues (large curvature directions in /(#)), but is a poor approximation 
in the directions corresponding to small eigenvalues (flatter directions in /(#)). In order to overcome 
this problem, we use low-rank approximation. More precisely, we treat all the eigenvalues below the 
r-th as if they were equal to the (r + l)-th. This yields the desired stability with respect to the 
sub-sample: we call our algorithm NewSamp. In this paper, we establish the following: 


1. NewSamp has a composite convergence rate: quadratic at start and linear near the minimizer, 
as illustrated in Figure [l] Formally, we prove a bound of the form 

||0 t+1 — ||2 < — 0 * II2 + £211^ — 0*||! 

with coefficient that are explicitly given (and are computable from data). 

2. The asymptiotic behavior of the linear convergence coefficient is lim^oo = 1 — (A*/A* +1 ) + 5, 
for 5 small. The condition number (A*/A*) which controls the convergence of GD, has been 
replaced by the milder (A* +1 /A*). For datasets with strong spectral features, this can be a 
large improvement, as shown in Figure [l| 

3. The above results are achived without tuning the step-size, in particular, by setting rjt = 1. 
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4. The complexity per iteration of NewSamp is 0(np + | S'|_p 2 ) with |Sj the sample size. 

5. Our theoretical results can be used to obtain convergence rates of previously proposed sub¬ 
sampling algorithms. 

We demonstrate the performance of NewSamp on four datasets, and compare it to the well-known 
optimization methods. 

The rest of the paper is organized as follows: Section fTTI surveys the related work. In Section[2j we 
describe the proposed algorithm and provide the intuition behind it. Next, we present our theoretical 
results in Section [3j i.e., convergence rates corresponding to different sub-sampling schemes, followed 
by a discussion on how to choose the algorithm parameters. Two applications of the algorithm are 
discussed in Section [4j We compare our algorithm with several existing methods on various datasets 
in Section [5} Finally, in Section [6j we conclude with a brief discussion. 

1.1 Related Work 

Even a synthetic review of optimization algorithms for large-scale machine learning would go beyond 
the page limits of this paper. Here, we emphasize that the method of choice depends crucially on the 
amount of data to be used, and their dimensionality (i.e., respectively, on the parameters n and p ). In 
this paper, we focus on a regime in which p is large but not so large as to make matrix manipulations 
(of order p 2 to p 3 ) impossible. Also n is large but not so large as to make batch gradient computation 
(of order np ) prohibitive. On the other hand, our aim is to avoid 0(np 2 ) calculations required by 
standard Newton method. Examples of this regime are given in Section [4j 

In contrast, online algorithms are the option of choice for very large n since the computation per 
update is independent of n. In the case of Stochastic Gradient Descent (SGD), the descent direction 
is formed by a randomly selected gradient |RM5l ], Improvements to SGD have been developed 
by incorporating the previous gradient directions in the current update [SRJB13, SHRY13 . iBotlfll. 
1DHS11| . 

Batch algorithms, on the other hand, can achieve faster convergence and exploit second order 
information. They are competitive for intermediate n. Several methods in this category aim at 
quadratic, or at least super-linear convergence rates. In particular, Quasi-Newton methods have 
proven effective |Bis95l lNes04| . Another approach towards the same goal is to utilize sub-sampling 
to form an approximate Hessian [MarlOl IBGNNllj IVP121 |QRTF15l IEM151 lErdlbaj . If the sub¬ 
sampled Hessian is close to the true Hessian, these methods can approach NM in terms of convergence 
rate, nevertheless, they enjoy much smaller complexity per update. No convergence rate analysis 
is available for these methods: this analysis is the main contribution of our paper. To the best of 
our knowledge, the best result in this direction is proven in jBGNNlT] that establishes asymptotic 
convergence without quantitative bounds (exploiting general theory from |GNS09] b 

Further improvements have been suggested either by utilizing Conjugate Gradient (CG) methods 
and/or using Krylov sub-spaces [Marlflll i BGNNllUVP12] . Sub-sampling can be also used to obtain 
an approximate solution, if an exact solution is not required |DLFU13] , Lastly, there are various 
hybrid algorithms that combine two or more techniques to gain improvement. Examples include, 
sub-sampling and Quasi-Newton [SYG071ISDPG131IBHNS14] . SGD and GD }FS12] , NGD and NM 
|RFlf)j . NGD and low-rank approximation [RaMBOH] . 
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Algorithm 1 NewSamp 

Input: 6°,r,e,{rit,\St\}t,t = 0 . 

1. Define: Vc{0) = argmin e / gC ||0 — O' ||2 is the Euclidean projection onto C, 

[Ufc, A/-] = TVuncatedSVDfc (H) is the rank-A; truncated SVD of H with (A/.)^ = Aj. 

2. while ||0 t+1 — 0*H 2 < e do 

Sub-sample a set of indices St C [n] . 

Let H 5t = J2ieS t and [U r+ i, A r+ i] = TruncatedSVD r+1 (H, St ), 

Q 4 = A^I P + U r (A- 1 - A r -| 1 I r ) , 
d t+1 =Vc(e t -r H Q t Vef(.d t )), 

t i — t - hi. 

3. end while 
Output: O'. 

2 NewSamp : A Newton method via sub-sampling and eigenvalue 
thresholding 

In the regime we consider, n 3> p 1, there are two main drawbacks associated with the classi¬ 
cal second order methods such as Newton’s method. The predominant issue in this regime is the 
computation of the Hessian matrix, which requires 0(np 2 ) operations, and the other issue is find¬ 
ing the inverse of the Hessian, which requires 0(p 3 ) computation. Sub-sampling is an effective and 
efficient way of addressing the first issue, by forming an approximate Hessian to exploit curvature 
information. Recent empirical studies show that sub-sampling the Hessian provides significant im¬ 
provement in terms of computational cost, yet preserves the fast convergence rate of second order 
methods {MarlOl, 1VP121 IErdl5b| . If a uniform sub-sample is used, the sub-sampled Hessian will 
be a random matrix with expected value at the true Hessian, which can be considered as a sample 
estimator to the mean. Recent advances in statistics have shown that the performance of various 
estimators can be significantly improved by simple procedures such as shrinkage and/or thresholding 
|CCS10i rDG.TES. GDT41IHDT4] . To this extent, we use a specialized low-rank approximation as the 
important second order information is generally contained in the largest few eigenvalues/vectors of 
the Hessian. We will see in Section [3j how this procedure provides faster convergence rates compared 
to the bare sub-sampling methods. 

NewSamp is presented as Algorithm [lj At iteration step t, the sub-sampled set of indices, 
its size and the corresponding sub-sampled Hessian is denoted by St, |St| and Hg t , respectively. 
Assuming that the functions ft s are convex, eigenvalues of the symmetric matrix H, 5 t are non¬ 
negative. Therefore, singular value (SVD) and eigenvalue decompositions coincide. The operation 
TruncatedSVDfc (Hg t ) = [Ufc,Afc] is the best rank-A; approximation, i.e., takes Hs t as input and 
returns the largest k eigenvalues in the diagonal matrix Aj. e M fcxfc with the corresponding k eigen¬ 
vectors Ufc 6 MP xk . This procedure requires 0(kp 2 ) computation using a standard method, though 
there are faster randomized algorithms which provide accurate approximations to the truncated SVD 
problem with much less computational cost |HMT11| . To construct the curvature matrix [Q f ] - 1 , 
instead of using the basic rank-?’ approximation, we fill its 0 eigenvalues with the (r + l)-th eigenvalue 
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Figure 1: Left plot demonstrates convergence rate of NewSamp , which starts with a quadratic rate 
and transitions into linear convergence near the true minimizer. The right plot shows the effect of 
eigenvalue thresholding on the convergence coefficients, x-axis shows the number of kept eigenvalues. 
Plots are obtained using Covertype dataset. 


of the sub-sampled Hessian which is the largest eigenvalue below the threshold. If we compute a 
truncated SVD with k = r + 1 and (Ak)u = A*, the described operation can be formulated as the 
following, 


Q* = A r -|,I p + U r (A” 1 - U 


T 
r ? 


( 2 . 1 ) 


which is simply the sum of a scaled identity matrix and a rank-r matrix. Note that the low-rank 
approximation that is suggested to improve the curvature estimation has been further utilized to 
reduce the cost of computing the inverse matrix. Final per-iteration cost of NewSamp will be 
O ( np + (\St\ + r)p 2 ) « O (np + |Si|p 2 ). NewSamp takes the parameters { 77 *, \St\}t and r as inputs. 
We discuss in Section |3.4[ how to choose these parameters near-optimally, based on the theory we 
develop in Section [3j 

Operator Vc projects the current iterate to the feasible set C using Euclidean projection. Through¬ 
out, we assume that this projection can be done efficiently. In general, most unconstrained optimiza¬ 
tion problems do not require this step, and can be omitted. The purpose of projected iterations in 
our algorithm is mostly theoretical, and will be clear in Section [3| 

By the construction of Q f , NewSamp will always be a descent algorithm. It enjoys a quadratic 
convergence rate at start which transitions into a linear rate in the neighborhood of the minimizer. 
This behavior can be observed in Figure [l] The left plot in Figure 1 shows the convergence behavior 
of NewSamp over different sub-sample sizes. We observe that large sub-samples result in better 
convergence rates as expected. As the sub-sample size increases, slope of the linear phase decreases, 
getting closer to that of quadratic phase at the transition point. This phenomenon will be explained 
in detail in Section [3j by Theorems 3.2 and |3.4| The right plot in Figure [l] demonstrates how the 
coefficients of linear and quadratic phases depend on the thresholded rank. Note that the coefficient 
of the quadratic phase increases with the rank threshold, whereas for the linear phase, relation is 
reversed. 
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3 Theoretical results 


In this section, we provide the convergence analysis of NewSamp based on two different sub-sampling 
schemes: 

SI: Independent sub-sampling: At each iteration t, St is uniformly sampled from [n] = {1, 2,n}, 
independently from the sets {<SV} T <t, with or without replacement. 

S2: Sequentially dependent sub-sampling: At each iteration t, St is sampled from [n]. based 
on a distribution which might depend on the previous sets {<S' r } T <i, but not on any randomness 
in the data. 

The first sub-sampling scheme is simple and commonly used in optimization. One drawback is 
that the sub-sampled set at the current iteration is independent of the previous sub-samples, hence 
does not consider which of the samples were previously used to form the approximate curvature 
information. In order to prevent cycles and obtain better performance near the optimum, one might 
want to increase the sample size as the iteration advances [Mar M, including previously unused 
samples. This process results in a sequence of dependent sub-samples which falls into the sub¬ 
sampling scheme S2. In our theoretical analysis, we make the following assumptions: 

Assumption 1 (Lipschitz continuity). For any subset S C [n], there exists a constant Afg\ depending 
on the size of S, such that \/6,9' G C, 

||Hs(0)-Hs(0')|| 2 <M| 5| ||0-0'|| 2 . 

Assumption 2 (Bounded Hessian). Vi = 1,2, ...,n, the Hessian of the function fi(0), W 2 g fi(0), is 
upper bounded by an absolute constant I\, i.e., 

max || Vg/j(0)|L < K. 

i nrt 1 1 1 1 " 


3.1 Independent sub-sampling 

In this section, we assume that St C [ n ] is sampled according to the sub-sampling scheme SI. In fact, 
many stochastic algorithms assume that St is a uniform subset of [n], because in this case the sub¬ 
sampled Hessian is an unbiased estimator of the full Hessian. That is, V# G C, E [13^(0)] = H[ n ](0), 
where the expectation is over the randomness in St . We next show that for any scaling matrix Q* that 


is formed by the sub-samples St, iterations of the form Eq. (1.2) will have a composite convergence 
rate, i.e., combination of a linear and a quadratic phases. 

Lemma 3.1. Assume that the parameter set C is convex and St C [n] is based on sub-sampling 
scheme SI. Further, let the Assumptions^ and^hold and G C. Then, for an absolute constant 


c > 0, with probability at least 1 — 2/p, the updates of the form Eq. (1.2) satisfy 


112 ^ 11^* — 112 + £211^* — @* 


|2 
12 5 


for coefficients £* and defined as 


£1 = 


I-rHQfHstf* 


+ rjtcK II 


/log(p) 


I s t 


tt _ n^ii 

S2 — Vt 11 H \ 
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Remark 1. If the initial point 9 0 is close to 9 *, the algorithm will start with a quadratic rate of 
convergence which will transform into linear rate later in the close neighborhood of the optimum. 


The above lemma holds for any matrix Q*. In particular, if we choose Q* = H^ 1 , we obtain a 
bound for the simple sub-sampled Hessian method. In this case, the coefficients and depend 
on ||Q 4 1 |2 = 1 /Xp where A* is the smallest eigenvalue of the sub-sampled Hessian. Note that A* can 
be arbitrarily small which might blow up both of the coefficients. In the following, we will see how 
NewSamp remedies this issue. 


Theorem 3.2. Let the assumptions in Lemma 3.1 hold. Denote by the i-th eigenvalue of'Hs t (9 


where 9 1 is given by NewSamp at iteration step t. If the step size satisfies 


Vt < 


2 

1 + A*/A(. +1 ’ 


(3.1) 


then we have, with probability at least 1 — 2/p, 


\\e t+1 - 9*\\ 2 < tfiwP 


9* II2 + £,2 


|| 2 
Il2i 


for an absolute constant c > 0, for the coefficients £( and are defined as 


ei = i - rh 


Ap 

At 


+ Vt 


r +1 



M-n 

z\ + i 


NewSamp has a composite convergence rate where and are the coefficients of the linear 
and the quadratic terms, respectively (See the right plot in Figure [I]). We observe that the sub¬ 
sampling size has a significant effect on the linear term, whereas the quadratic term is governed by 
the Lipschitz constant. We emphasize that the case r)t = 1 is feasible for the conditions of Theorem 


algorithm converges linearly. Following corollary summarizes this case. 


3.2 In the case of quadratic functions, since the Lipschitz constant is 0 , we obtain = 0 and the 


Corollary 3.3 (Quadratic functions). Let the assumptions of Theorem 3.2 hold. Further, assume 
that \/i £ [n], the functions 9 : M p —> fi{9) are quadratic. Then, for 9 l given by NewSamp at iteration 
step t, for the coefficient defined as in Theorem 3.2, with probability at least 1 — 2/p, we have 


\\ e t+1 - fl *|| 2 <£ 11 ^- 0 * 


(3.2) 


3.2 Sequentially dependent sub-sampling 

Here, we assume that the sub-sampling scheme S2 is used to generate {SV}r>i- Distribution of 
sub-sampled sets may depend on each other, but not on any randomness in the dataset. Examples 
include fixed sub-samples as well as sub-samples of increasing size, sequentially covering unused data. 
In addition to Assumptions [l]{2j we assume the following. 

Assumption 3 (i.i.d. observations). Let Z\, z-i, ■■■, z n £ Z be i.i.d. observations from a distribution 
D. For a fixed 9 £ and Mi £ [n], we assume that the functions {/i}f=i satisfy fi(9) = ip(zi, 9), for 
some function ^ : Z x ff -> R. 


7 












Most statistical learning algorithms can be formulated as above, e.g., in classification problems, 
one has access to i.i.d. samples {(y*, Xj)}” =1 where y l and X{ denote the class label and the covariate, 
and ip measures the classification error (See Section [4] for examples). For the sub-sampling scheme 
S2, an analogue of Lemma 3.1 is stated in Appendix as Lenmia |A.l| which immediately leads to the 
following theorem. 

Theorem 3.4. Assume that the parameter set C is convex and St C [n] is based on the sub-sampling 
scheme S2. Further, let the Assumptions [1 1 a nd[j| hold, almost surely. Conditioned on the event 


£ = { 9 * G C}, if the step size satisfies Eq. 3.1. then for 6 1 given by NewSamp at iteration t, with 


probability at least 1 — cp e p for eg = c/P(£), we have 


\\e t+1 -0*|| 2 < £il|0 t -0*l|2 + £||0 t -0*||l 


for the coefficients £( and tf 2 defined as 


tt , A* d K I p /diam(C) 2 [M n + M ,5 ,) 2 \S t 

?i=1_ ^ + %m log l-**- 


Vt 


M n 

2\t 


r +1 


where c, d > 0 are absolute constants and \\ denotes the i-th eigenvalue o/Hs,(r). 

Compared to the Theorem |3.2| we observe that the coefficient of the quadratic term does not 
change. This is due to Assumption [lj However, the bound on the linear term is worse, since we use 
the uniform bound over the convex parameter set C. The same order of magnitude is also observed 
by |Erdl5bj . which relies on a similar proof technique. Similar to Corollary |3.3[ we have the following 
result for the quadratic functions. 


Corollary 3.5 (Quadratic functions). Let the assumptions of Theorem 3.f hold. Further assume that 
Vi G [n], the functions 6 —> fi{0) are quadratic. Then, conditioned on the event £, with probability 
at least 1 — cs e~ p , NewSamp iterates satisfy 


\\ 0 t+1 -942 < £ 111 ^- 0 * 112 , 


for coefficient defined as in Theorem 3.f 


3.3 Dependence of coefficients on t and convergence guarantees 

The coefficients and ff 2 depend on the iteration step which is an undesirable aspect of the above 
results. However, these constants can be well approximated by their analogues and evaluated 
at the optimum which are defined by simply replacing A* with A!- in their definition, where the latter 
is the j-th eigenvalue of full-Hessian at 0 *. For the sake of simplicity, we only consider the case where 
the functions 9 —> ffO ) are quadratic. 

Theorem 3.6. Assume that the functions fi{0) are quadratic, St is based on scheme SI and ry = 1 ■ 
Let the full Hessian at 9* be lower bounded by a constant k. Then for sufficiently large |St|, we have, 
with probability 1 — 2/p 

| pt _y*l < c f E V§gMZM - = a 

11 11 “ k{k-c 2 K^\og{p)/\S t \) ' ’ 


for some absolute constants c\,C 2 - 























Theorem 3.6 implies that, when the sub-sampling size is sufficiently large, £* will concentrate 
around Generalizing the above theorem to non-quadratic functions is straightforward, in which 
case, one would get additional terms involving the difference || 9 l — 0*11 2 - In the case of scheme S2, 
if one uses fixed sub-samples, i.e., Vf, St = S, then the coefficient does not depend on t. The 
following corollary gives a sufficient condition for convergence. A detailed discussion on the number 
of iterations until convergence and further local convergence properties can be found in Appendix [B} 

Corollary 3.7. Assume that and are well-a pproximated by £* and Q with an error bound of 
5, i.e., < f* + <5 for i = 1,2, as in Theorem 3.6 For the initial point 9°, a sufficient condition for 

convergence is 


\\ 0 ° 


< 


l-S-6 

e 2 + s 


3.4 Choosing the algorithm parameters 

Algorithm parameters play a crucial role in most optimization methods. Based on the theoretical 
results from previous sections, we discuss procedures to choose the optimal values for the step size 
T ) t , sub-sample size \St\ and rank threshold. 


Step size: For the step size of NewSamp at iteration t, we suggest 


Vth) = 


1 + A*/A * +1 + 7 


(3.3) 


where 7 = 0(log(p)/|S)|). Note that r]t(0) is the upper bound in Theorems 3.2 and 3.4 and it 
minimizes the first component of . The other terms in and £2 linearly depend on ry. To 
compensate for that, we shrink rjt( 0) towards 1. Contrary to most algorithms, optimal step 
size of NewSamp is larger than 1. See Appendix [C] for a rigorous derivation of Eq. 3.3 


Sample size: By Theorem 3.2 a sub-sample of size 0((K/X*) 2 log(p)) should be sufficient 


to obtain a small coefficient for the linear phase. Also note that sub-sample size \St\ scales 
quadratically with the condition number. 


• Rank threshold: For a full-Hessian with effective rank R (trace divided by the largest eigen¬ 
value), it suffices to use 0(R\og(p)) samples |Verl 0 l IVerl 2 j . Effective rank is upper bounded 
by the dimension p. Hence, one can use p\og(p) samples to approximate the full-Hessian and 
choose a rank threshold which retains the important curvature information. 


4 Examples 

4.1 Generalized Linear Models 

Finding the maximum likelihood estimator in Generalized Linear Models (GLMs) is equivalent to 
minimizing the negative log-likelihood f(9 ), 

1 n 

minimize f{0) = - ^ [$((zi, 0)) - yi(xi, 9)] , (4.1) 

2=1 
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where $ is the cumulant generating function, E M denotes the observations, Xi E M p denotes the 
rows of design matrix X E M nxp , and 9 E M p is the coefficient vector. Note that this formulation only 
considers GLMs with canonical links. Here, (x, 6) denotes the inner product between the vectors x, 
9. The function defines the type of GLM. Well known examples include ordinary least squares 
(OLS) with 3>(z) = z 2 , logistic regression (LR) with = log(l + e z ), and Poisson regression (PR) 
with 4>(,z) = e z . 

The gradient and the Hessian of the above function can be written as: 

n i n 

Vfl/(0) = -y j \^ 1 \{x i ,e))x i -y i x i 1 , Vim = -Y j & 2 \(x i ,e))x i xJ. (4.2) 

n z J L J n zJ 

i =1 i— 1 


We note that the Hessian of the GLM problem is always positive definite. This is because the second 
derivative of the cumulant generating function is simply the variance of the observations. Using the 
results from Section [3j we perform a convergence analysis of our algorithm on a GLM problem. 

Corollary 4.1. Let St C [n] be a uniform sub-sample, and C be a convex parameter set. Assume that 
the second derivative of the cumulant generating function, is bounded by 1, and it is Lipschitz 
continuous with Lipschitz constant L. Further, assume that the covariates are contained in a ball 
of radius \[Rx, Le. max ie j n ] 11 xq-11 2 < \fRf- Then, for 9 t given by NewSamp with constant step size 
rjt = 1 at iteration t, with probability at least 1 — 2/p, we have 

ll ^ +1 — II2 < £i|| 0 * — 9 *||2 + £2!!^ — ^*|||> 

for constants and defined as 

K , /log(p) , t _LRl /2 

41 K + i K + i^ ISti ’ 42 2 a* +1 ’ 

where c> 0 is an absolute constant and \\ is the ith eigenvalue of lls t (9 t ). 

Proof of Corollary |4 .11 can be found in Appendix [A) Note that the bound on the second derivative 
is quite loose for Poisson regression due to exponentially fast growing cumulant generating function. 


4.2 Support Vector Machines 


A linear Support Vector Machine (SVM) provides a separating hyperplane which maximizes the mar¬ 
gin, i.e., the distance between the hyperplane and the support vectors. Although the vast majority 
of the literature focuses on the dual problem |Vap98 , FSS02l . SVMs can be trained using the primal 
as well. Since the dual problem does not scale well with the number of data points (some approaches 
get 0(n 3 ) complexity, [WGllj i. the primal might be better-suited for optimization of linear SVMs 
jlvDOol IChaOT] . 

The primal problem for the linear SVM can be written as 


1 1 

minimize f(9) = -||0||! + -C'y\i(y i ,(0,x i )) 

6»eC 2 2 

2=1 


(4.3) 


where ( y%,Xi ) denote the data samples, 9 defines the separating hyperplane, C > 0 and £ could be 
any loss function. The most commonly used loss functions include Hinge-p loss, Huber loss and their 
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smoothed versions EE a 07]. Smoothing or approximating such losses with more stable functions is 
sometimes crucial in optimization. In the case of NewSamp which requires the loss function to be 
twice differentiable (almost everywhere), we suggest either smoothed Huber loss, i.e., 


£{y, (6,x)) 


' 0 , 

< (3/2 -y(9,x)) 2 
2 

l-y{0,x), 


if y(9,x) > 3/2, 
if |1 - y(9,x) | < 1/2, 
otherwise. 


or Hinge-2 loss, i.e., 

t(y, (0,x)) = max {0,1 - y(6, x)} 2 . 

For the sake of simplicity, we will focus on Hinge-2 loss. Denote by SVt, the set of indices of all the 
support vectors at iteration t, i.e., 


SV t = {i : yi(6 t ,Xi) < 1}. 

When the loss is set to be the Hinge-2 loss, the Hessian of the SVM problem, normalized by the 
number of support vectors, can be written as 


v 2 e m = ^—{i+c y XiX j). 

1 tl ieSVt 

When \SVt\ is large, the problem falls into our setup and can be solved efficiently using NewSamp. 
Note that unlike the GLM setting, Lipschitz condition of our Theorems do not apply here. However, 
we empirically demonstrate that NewSamp works regardless of such assumptions. 


5 Experiments 


In this section, we validate the performance of NewSamp through extensive numerical studies. We 
experimented on two optimization problems, namely, Logistic Regression (LR) and Support Vector 
Machines (SVM) with quadratic loss. LR minimizes Eq. 4.1 for the logistic function, whereas SVM 
minimizes Eq. |4.3| for the Hinge-2 loss. 

In the following, we briefly describe the algorithms that are used in the experiments: 


1. Gradient Descent (GD), at each iteration, takes a step proportional to negative of the full 
gradient evaluated at the current iterate. Under certain regularity conditions, GD exhibits a 
linear convergence rate. 

2. Accelerated Gradient Descent (AGD) is proposed by Nesterov |Nes83] . which improves over 
the gradient descent by using a momentum term. Performance of AGD strongly depends of 
the smoothness of the function / and decreasing step size adjustments may be necessary for 
convergence. 

3. Newton’s Method (NM) achieves a quadratic convergence rate by utilizing the inverse Hessian 
evaluated at the current iterate. However, the computation of Hessian makes it impractical for 
large-scale datasets. 
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Figure 2: Performance of various optimization methods on different datasets. NewSamp is repre¬ 
sented with red color . 

4. Broyden-Fletcher-Goldfarb-Shanno (BFGS) is the most popular and stable Quasi-Newton method. 
Scaling matrix is formed by accumulating the information from iterates and gradients, satis¬ 
fying Quasi-Newton rule. The convergence rate is locally super-linear and per-iteration cost is 
comparable to first order methods. 

5. Limited Memory BFGS (L-BFGS) is a variant of BFGS, which uses only the recent iterates 
and gradients to form the approximate Hessian, providing significant improvement in terms of 
memory usage. 

6 . Stochastic Gradient Descent (SGD) is a simplified version of GD where, at each iteration, 
instead of the full gradient, a randomly selected gradient is used. Per-iteration cost is indepen¬ 
dent of n, yet the convergence rate is significantly slower compared to batch algorithms. We 
follow the guidelines of [ BotlO L fSHRY13j for the step size,, i.e., 

_ 7 

Ti | | , / > 

1 + t/c 

for constants 7 , c > 0 . 

7. Adaptive Gradient Scaling (AdaGrad) is an online algorithm which uses an adaptive learning 
rate based on the previous gradients. AdaGrad significantly improves the performance and 
stability of SGD I )l IS I I . This is achieved by scaling each entry of gradient differently. , i.e., 
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at iteration step t, step size for the j-th coordinate is 


(7t), = 7 ' = , 

for constants 5 ,7 > 0 . 

For each of the batch algorithms, we used constant step size, and for all the algorithms, we choose the 
step size that provides the fastest convergence. For the stochastic algorithms, we optimized over the 
parameters that define the step size. Parameters of NewSamp are selected following the guidelines 
described in Section l3~4l 

We experimented over various datasets that are given in Table [l] The real datasets are down¬ 
loaded from the UCI repository [Lie m- Each dataset consists of a design matrix X 6 M nxp and 
the corresponding observations (classes) y £ M n . Synthetic data is generated through a multivariate 
Gaussian distribution with a randomly generated covariance matrix. As a methodological choice, we 
selected moderate values of p, for which Newton’s Method can still be implemented, and nevertheless 
we can demonstrate an improvement. For larger values of p, comparison is even more favorable to 
our approach. 

The effects of sub-sampling size \St\ and rank threshold are demonstrated in Figure[lJ A thorough 
comparison of the aforementioned optimization techniques is presented in Figure [2j In the case of LR, 
we observe that stochastic algorithms enjoy fast convergence at start, but slows down later as they get 
close to the true minimize!'. The algorithm that comes close to NewSamp in terms of performance 
is BFGS. In the case of SVM, Newton’s method is the closest algorithm to NewSamp, yet in all 
scenarios, NewSamp outperforms its competitors. Note that the global convergence of BFGS is not 
better than that of GD [Nes04] , The condition for super-linear rate is \\6 t — 0*||2 < 00 for which, 
an initial point close to the optimum is required |DM77j . This condition can be rarely satisfied in 
practice, which also affects the performance of the other second order methods. For NewSamp , even 
though the rank thresholding provides a certain level of robustness, we observed that the choice of 
a good starting point is still an important factor. Details about Figure [2] can be found in Table [3] in 
Appendix. For additional experiments and a detailed discussion, see Appendix |D| 


Dataset 

n 

P 

r 

Reference 

CT slices 

53500 

386 

60 

GKS+11 

Covertype 

581012 

54 

20 

[BD99] 

MSD 

515345 

90 

60 

IBMEWL11| 

Synthetic 

500000 

300 

3 

- 


Table 1: Datasets used in the experiments. 


6 Conclusion 

In this paper, we proposed a sub-sampling based second order method utilizing low-rank Hessian 
estimation. The proposed method has the target regime n p and has O (np + |*S'|p 2 ) complexity 
per-iteration. We showed that the convergence rate of NewSamp is composite for two widely used 
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sub-sampling schemes, i.e., starts as quadratic convergence and transforms to linear convergence 
near the optimum. Convergence behavior under other sub-sampling schemes is an interesting line of 
research. Numerical experiments on both real and synthetic datasets demonstrate the performance 
of the proposed algorithm which we compared to the classical optimization methods. 
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A Proofs of Theorems and Lemmas 


Proof of Lemma 3.1. We write, 


- 9 ,- ruQfVefiP) = &-&*- ritQ' /' + r(0* - 0*))(0* - <?*) dr, 

Jo 

= (/ - JftQ* J ^lf{0* + r(<? 4 - 0,))dr) (0* - 0*) • 

Since the projection Pc hi step 2 of NewSamp can only decrease the £2 distance, we obtain 

/•1 


\\ e t+1 - 0*|| 2 < 


I-r]tQ V|/(0* + r(^-0*))dr 


/o 


|| 0*-0 


* 2 • 


Note that the first term on the right hand side governs the convergence behavior of the algorithm. 
Next, for an index set S C [n], define the matrix Hs(0) as 

1 


h ^) = m £ h ‘W 




where |Sj denotes the size of the set. Denote the integral in the above equation by H, that is, 

H= f 1 V^/( 0 * + r( 0 * - 6 m ))dr. 


By the triangle inequality, the governing term that determines the convergence rate can be 
bounded as 


I - 7ft Q*H 


< 


I - THQfHsie* 


(A.l) 


+ 77 * ||q *|| 2 { ||h s (^) - h n (^)|| 9 + |h w ( 0 *) - h || 2 }, 


which holds, regardless of the choice of QL 

In the following, we will use some matrix concentration results to bound the right hand side 
of Eq. (A.l). The result for sampling with replacement can be obtained by matrix Hoeffding’s 
inequality given in jTYol2| . Note that this explicitly assumes that the samples are independent. For 
the concentration bounds under sampling without replacement (see i.e. f GNlO. 1 . Grolll [MJC + 14] ). 
we will use the Operator-Bernstein inequality given in |GN10| which is provided in Section [E] as 
Lemma lE.3l for convenience. 

Using any indexing over the elements of sub-sample S, we denote the each element in S by s,;, 

i.e., 

S = {si, s 2 , ■■■, S|s|}- 

For 0 £ C, we define the centered Hessians, W,;(0) as 

Wi(0) = H s .(0)-E[H 8 .(0)], 
where the E [H Si (0)] is just the full Hessian at 6. 
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By the Assumption Q, we have 

max 11 Hj ( 0 )|| 2 = ||V|/i(0)|L < K, (A. 2 ) 

i<n " 

max || Wj|| 2 < 2K := 7 , max ||wf || < 4 K 2 := o -2 . 

z<n i<n ^ 


Next, we apply the matrix Bernstein’s 


inequality given in Lemma E.3 


For e < 4A', and 9 £ C, 


P (||H S («) - H W (9)|| 2 > e ) < 2pexp {-^1} . 
Therefore, to obtain a convergence rate of 0(l/p), we let 


(A.3) 


e = 



logQ) 

\S\ 


where C = 6 K is sufficient. We also note that the condition on e is trivially satisfied by our choice 
of e in the target regime. 

For the last term, we may write, 




H [n] (0*)- / Vg/($* + — 0*))dr 


< 


f H [n] 0* 

Jo 


)-v 2 g f(e Jf +T(e t -e*)) 


dr, 


»i 


</ M n (l - t)\\9- 0*|| 2 dr, 

^ 112 - 

First inequality follows from the fact that norm of an integral is less than or equal to the integral 
of the norm. Second inequality follows from the Lipschitz property. 

Combining the above results, we obtain the following for the governing term in Eq.(A.l): For 
some absolute constants c, C > 0, with probability at least 1 — 2/p, we have 


I - 77 t Q*H w (^)| 2 < ||J - rnO/HsiP)^ + m ||Q*|| 2 




}■ 


Hence, the proof is completed. 


□ 


Proof of Theorem 3.2. Using the definition of Q 4 in NewSamp, we immediately obtain that 


I-thQ^s^ 


= max 

2 i>r 


A r+1 


(A.4) 


and that ||Qq| 2 = 1 /. Then the proof follows from Lemma 3.1 and by the assumption on the 
step size. □ 
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Lemma A.l. Assume that the parameter set C is convex and St C [n] is based on sub-sampling 
scheme S2. Further, let the Assumptions [I], [2] and [3] hold, almost surely. Then, for some absolute 
constants c,C>0, with probability at least 1 — e~ p , the updates of the form stated in Eq. (1.2) satisfy 


||# f+1 — 3* || 2 < £ i ||0 4 — 0 * || 2 + £ 2 l |3 f — 


|2 

12 ) 


for coefficients ^,£2 defined as 


ei= i-vtQ^sA^) 


+ r]t ||Q*|L x cK 


N 


p log ( diam ( C ) 2 ( M n + M |5t |) 2 \ S t\ 

I <St| 




K 2 


e$=^l|Q‘| 


2 ■ 


Proof of Lemma \A.1\ The first part of the proof is the same as Lemma 3.1 We carry our analysis 
from Eq.(A.l). Note that in this general set-up, the iterates are random variables that depend on 
the random functions. Therefore, we use a uniform bound for the right hand side in Eq.(A.l). That 


is, 


I-mQ *H 2 < I - ritOfHsiO*) 

+m ||Ql| 2 { sup ||H S (3) - H [n] (6»)|| 2 + ^||3 4 - 3 , 112 }. 

By the Assumption [TJ given 9,0' E C such that \\6 — 0'\\2 < A, we have, 

||H S (3) - H w (3)|| 2 < ||H S (3') - H w ( 3')|| 2 + (M n + 1 %,) || 9 - 3'|| 2 
< ||H s(O') - H w ( 3')|| 2 + (■Mn + M,5|) A. 


Next, we will use a covering net argument to obtain a bound on the matrix empirical process. 
Note that similar bounds on the matrix forms can be obtained through other approaches like chaining 
as well jPET5] . Let 7a be a A-net over the convex set C. By the above inequality, we obtain 


sup ||H S (3) - H w (3)|| 2 < nmx ||H S (3') - H [n] (3')|| 2 + {M n + M ]s ,) A. 


(A.5) 


Now we will argue that the right hand side is small with high probability using the matrix 
Hoeffding’s inequality from [Tro H- By the union bound over 7a, we have 


P (max ||Hs(3') - H [n] (3')|| 2 > e) <|Ta| P (||H s ( 3') - H [n] (3')|| 


> e . 


For the hrst term on the right hand side, by Lemma |E. 1 we write: 


/ diam(C)y 

|7il 5 I.WAL ■ 
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As before, let S = {si, S 2 ,S| 5 |}, that is, s* denote the different indices in S. For any 9 e C and 
i = 1,2, we define the centered Hessians W i{6) as 


W i (0) = H Si (0)-H [n] (0). 

By the Assumption ([2]), we have the same bounds as in Eq.( |A.2 ). Hence, for e > 0 and 9 E C, by 
the matrix Hoeffding’s inequality )Trol2j . 

|H S («)-H W (0)|| 2 > £ )< 2pexp{-i|H}. 

We would like to obtain an exponential decay with a rate of at least 0(p). Hence, we require, 


plog + l„g(2p) + p < plog ( 4 d " m f^ ) , 

“ 32A" 2 ’ 

which gives the optimal value of e as 


e > 


' 32 I< 2 p ^ ^4diam(C)^p\ 


\S\ 


Therefore, we conclude that for the above choice of e, with probability at least 1 — e p , we have 

|H S (0)-H [n] (0)|| 2 < 


max 

9&Ta 


32K 2 p f 4diarn(C) y/pX 


\S\ 


log 




A 


■ 


Applying this result to the inequality in Eq.(A.5), we obtain that with probability at least 1 — e p , 


sup||H s (0)-H [n] (0)|| 2 < 

0EC 


32 I< 2 p f 4diam(C)^/p\ 


\S\ 


log 




A 


J + ( M n + Mjs|) A. 


The right hand side of the above inequality depends on the net covering diameter A. We optimize 
over A using Lemma E.5 which provides for 


A = 4. 


K 2 p 


A -9- l°g 

\| {M n + M {sl ) 2 \S\ V 


I LilCXlJLl^C' J yiVJLyi 




K 2 


we obtain that with probability at least 1 — e p , 


sup H 5 (6») 

6>eC 


Hr, 


9 < 8 K, 

\| 


li lo H 


f diam(C) 2 [M n + M\ S |) 2 |S| 


K 2 


Combining this with the bound stated in Eq.(A.l), we conclude the proof. 


□ 
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Proof of Theorem 3.6 . 


!£-£[ 


Ap 

a; 

+ CK\ 

/log ip) 

1 

1 

Y-l 

*;+i 

1 N 

\‘+i 

A r+1 


< 


k\K+i - a*. +1 | + a'|a* - a; 

A* + iA* + i 


+ cA i 


'log(p) |A * +1 - A, 


r+ll 


I St 


A*+iA^+i 


By the Weyl’s and matrix Hoeffding ’s |Tr o!2| inequalities (See Eq. (|A.3|) for details), we can write 


IAj — A*| < H 5t (^)-H [n] (0,) 


< cK \ 


/log(p) 


I St 


with probability 1 — 2/p. Then, 


c'Aa/^E c "k 21 °^ 

I c t t*\ s' V l 6 *l , \St\ 

I? 1 “ Si I n h "T* V ’ 

/ 'm I 1 ' / 'V I "I _I_ I S\ /r , | 1 


< 


V+l'V+l / V+1 / Y+1 
juts /log(p) 

c A v w 


k ( k — cK 


log(p) 

|A| 


for some constants c and c w . 


□ 


Proof of Corollary^ 7} Observe that /*(6>) = <J>((x*,0)) - yi(xi,6), and V 2 e fi{9) = Xixf^ 2 \(xi,9)). 
For an index set S, we have V0, 9' E C 


|h s (0)-h 5 (0')|| 2 = 


|S| 


^Xixf $ (2) «Xi,0)) - 


ieS 


< A max ||Xj ||2 ||0 — 0 7 ||2 < LR?J 2 \\9 — 9 '|| 2 . 


ieS 


Therefore, the Assumption [I] is satisfied with the Lipschitz constant M| 5t | := LPl/J 2 . Moreover, by 
the inequality 

||V0/i(0)|| 2 = ||xj|| 2 <f> (2) ((xj,60) < R x ,= XixJ^ 2 \(xi,9)) 


the Assumption [2] is satisfied for K := R x . We conclude the proof by applying Theorem 3.2 


□ 


B Properties of composite convergence 

In the previous sections, we showed that NewSamp gets a composite convergence rate, i.e., the £2 
distance from the current iterate to the optimal value can be bounded by the sum of a linearly and 
a quadratically converging term. We study such convergence rates assuming the coefficients do not 
change at each iteration t. Denote by A t, the aforementioned £2 distance at iteration step t, i.e., 

A t = ||<? f -M2 , (B.l) 
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and assume that the algorithm gets a composite convergence rate as 

Vt > 0, A t+ i < £iAt + £2 Af, 

where £i ,£2 > 0 denote the coefficients of linearly and quadratically converging terms, respectively. 


B.l Local asymptotic rate 

We state the following theorem on the local convergence properties of compositely converging algo¬ 
rithms. 


Lemma B.l. For a compositely converging algorithm as in Eq. (B.l) with coefficients 1 > £ 1,^2 > 0, 
if the initial distance Aq satisfies Aq < (1 — £i)/£ 2 , then we have 


lim sup — - log (A*) < -log(£i). 

t—> OO * 


The above theorem states that the local convergence of a compositely converging algorithm will 
be dominated by the linear term. 


Proof of Lemma B.l. The condition on the initial point implies that A< —> 0 as t —> 00 . Hence, for 


any given 5 > 0, there exists a positive integer T such that Vi > T, we have A t < 5/£ 2 - For such 
values of t, we write 


£1 + £2 A t < £1 + S, 


and using this inequality we obtain 

At+i < (£1 + 6)At. 

The convergence of above recursion gives 

— ~ log (At) < — log(£i + <5) - ^log(Ao). 

Taking the limit on both sides concludes the proof. □ 


B.2 Number of iterations 


The total number of iterations, combined with the per-iteration cost, determines the total complexity 
of an algorithm. Therefore, it is important to derive an upper bound on the total number of iterations 
of a compositely converging algorithm. 


Lemma B.2. For a compositely converging algorithm as in Eq. (B.l) with coefficients £ i ,£2 £ (0,1), 
assume that the initial distance Ao satisfies Ao < (1 — £ 1)/£2 and for a given tolerance e, define the 
interval 


D 


- 




max 


= ^ lA ° \ A 
-’1-£ 2 A 0 J ’ 


Then the total number of iterations needed to approximate the true minimizer with e tolerance is 
upper bounded by T(5f), where 


<5* = argmin (5gl )T((5) 
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and 


T(S ) = log 2 


( log (6 + 6 C 2 ) \ 

VMw(a+<^))/ 


, lQ g (f) 

iog(6 + 6<5)’ 


Proof of Lemma B. 2 


a real number and 1 1 


We have At —> 0 as t —> 00 by the condition on initial point Ao- Let 5 £ D be 
be the last iteration step such that A t > 5. Then \/t > t\, 


At+i <fiA t + ^Af, 

<(f + fe)A?. 


Therefore, in this regime, the convergence rate of the algorithm is dominated by a quadratically 
converging term with coefficient (fi/S + £2)- The total number of iterations needed to attain a 
tolerance of 5 is upper bounded by 


1 1 < log 2 


( log (6 + Sf 2 ) \ 

\ lo g (^(6 + ^ 2 )) J ' 


When At < 6, namely t > t \, we have 


At+i <^iA f +^ 2 Aj, 

< (?i + C2<5) A t . 


In this regime, the convergence rate is dominated by a linearly converging term with coefficient 
(Ci + C2<5)- Therefore, the total number of iterations since t\ until a tolerance of e is reached can be 
upper bounded by 


t 2 < 


Ml) 

iog({i +65)' 


Hence, the total number of iterations needed for a composite 
tolerance of e is upper bounded by 


algorithm as in Eq. B.l to reach a 


T(S) = h + t 2 


log 2 


( log (Cl + \ 

(log (^(6 + 56))/ 


(I) 

log(Cl +C2^)’ 


The above statement holds for any <5 e D. Therefore, we minimize T(<5) over the set D. 


n 


C Choosing the step size rj t 

In most optimization algorithms, step size plays a crucial role. If the dataset is so large that one 
cannot try out many values of the step size. In this section, we describe an efficient and adaptive 
way for this purpose by using the theoretical results derived in the previous sections. 
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In the proof of Lemma 3.1 we observe that the convergence rate of NewSamp is governed by the 
term 


i-r h q!H [n] (e) < i- m q/H [n] {e 


+ vt 11 Q f 


#[n](^) - H[ n ](9) 


where Q f is dehned as in Algorithm 1. The right hand side of the above equality has a linear 
dependence on rjt- We will see later that this term has no effect in choosing the right step size. On 
the other hand, the first term on the right hand size can be written as, 


I - ritQ f H b 


.](#*) || 2 = max 11 - r ? tA min (Q t iL w (^)),r 7 tA max (Q t IL w (0*)) - l} 


If we optimize the above quantity over r)t, we obtain the optimal step size as 

2 

Vt = 


A m in(Q t H[ n ](0 t )) + A max (Q < i+](6k)) 


(C.l) 


It is worth mentioning that for the Newton’s method where Q f = H\^{9 t ) _1 , the above quantity is 
equal to 1. 

Since NewSamp does not compute the full Hessian H[ n ]((L) (which would take 0(np 2 ) computa¬ 
tion), we will relate the quantity in Eq. (C.l) to the first few eigenvalues of QL Therefore, our goal 


is to relate the eigenvalues of Q, t H^(9 t ) to that of Q f . 
By the Lipschitz continuity of eigenvalues , we write 


l-A max (Q f + n] (+) < ||Q*|| 2 Hsih-H^) 

1 


At 


r+l 



(C.2) 


Similarly, for the minimum eigenvalue, we can write 


\t 

Ap 


At 


r+l 


-A min (Q^ w (0+) 


< 


Ar+l 


O 


/log(p) 

\S\ 


(C.3) 


One might be temped to use 1 and Ap/A* +1 for the minimum and the maximum eigenvalues of 
Q, t Hi n \(9 t ), but the optimal values might be slightly different from these values if the sample size is 
chosen to be small. On the other hand, the eigenvalues A* +1 and A* can be computed with 0(p 2 ) 
cost and we already know the order of the error term. That is, one can calculate A* +1 and A* and 
use the error bounds to correct the estimate. 

The eigenvalues of the sample covariance matrix will concentrate around the true values, spreading 
to be larger for large eigenvalues and smaller for the small eigenvalues. That is, if we will we will 
overestimate if we estimate Ai with A^. Theref ore, i f we use 1 , we will always underestimate the 
value of A max (Q t + n ](6 )t )), which, based on Eq. (C.2) and Eq. (C.3), suggests a correction term of 

O ^yiog(p)/|5| j. Further, the top r + l eigenvalues of [Q+ 1 are close to the eigenvalues of + n ]( 9 + 
but shifted upwards if p /2 > r. When pj 2 < r, we see an opposite behavior. Hence, we add 
or subtract a correction term of order O ^^/log(p)/|5|^ to A*/A* +1 whether p/2 > r or p/2 < r, 
respectively. The corrected estimators could be written as 
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if p/2 > r, 
if p/2 < r. 


We are more interested in the case where p/2 > r. In this case, we suggest the step size for the 
iteration step t as 


Vt 


2 

1 + ^ + 0 

A r +1 



which uses the eigenvalues that are already computed to construct Q 4 . Contrary to the most algo¬ 
rithms, the optimal step size of NewSamp is generally larger than 1. 


D Further experiments and details 

In this section, we present the details of the experiments presented in Figure [2]and provide additional 
simulation results. 

We first start with additional experiments. The goal of this experiment is to further analyze 
the effect of rank in the performance of NewSamp . We experimented using r-spiked model for 
r = 3,10, 20. The case r = 3 was already presented in Figure [2j which is included in Figure [3] to 
ease the comparison. The results are presented in Figures [3] and the details are summarized in Table 
[2] In the case of LR optimization, we observe through Figure [3] that stochastic algorithms enjoy 
fast convergence in the beginning but slows down later as they get close to the true minimizer. The 
algorithms that come closer to NewSamp in terms of performance are BFGS and LBFGS. Especially 
when r = 20, performance of BFGS and that of NewSamp are similar, yet NewSamp still does better. 
In the case of SVM optimization, the algorithm that comes closer to NewSamp is Newton’s method. 

We further demonstrate how the algorithm coefficients and £2 between datasets in Figure |4j 


E Useful lemmas 

Lemma E.l. Let C be convex and bounded set in and T e be an e-net over C. Then, 

/ diam(C) y 
1 e| - V 2 e/VP ) 


Proof of Lemma E.L A similar proof appears in IVdVW96l . The set C can be contained in a p- 
dimensional cube of size diam(C). Consider a grid over this cube with mesh width 2 e/^/p. Then C 
can be covered with at most (diam(C)/(2e/y / p)) p many cubes of edge length 2 e/y/p. If ones takes 
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Logistic Regression, rank=3 Logistic Regression, rank=10 Logistic Regression, rank=20 



6 10 20 30 40 50 6 10 20 30 40 50 6 10 20 30 40 50 

Time(sec) Time(sec) Time(sec) 


SVM, rank=3 



6 25 50 75 10C 

Time(sec) 


SVM, rank=10 SVM, rank=20 



i i i i -3-1. i . i . i . i 

0 25 50 75 10C 0 25 50 75 10C 

Time(sec) Time(sec) 


Figure 3: The plots demonstrate the behavior of several optimization methods on a synthetic data 
set for training SVMs. The elapsed time in seconds versus log of ^-distance to the true minimizer 
is plotted. Red color represents the proposed method NewSamp . 


the projection of the centers of such cubes onto C and considers the circumscribed balls of radius e, 
we may conclude that C can be covered with at most 

/ diam(C)\ p 

V 2 e/^p ) 


many balls of radius e. □ 

Lemma E.2 ( jVerlO] !. Let X be a symmetric pxp matrix, and let T e be an e-net over S p ~*. Then, 

IWk < 1 sup | (Xv, v )|. 

1 - 2e vGT e 

Lemma E.3 ( |GN10] ). Let X be a finite set of Hermitian matrices in M pxp where MXi £ X, we have 

E[*i]=o, M 2 <7, ||E[x|]|| 2 <a 2 . 

Given its size, let S denote a uniformly random sample from {1, 2, \X\} with or without replace¬ 
ment. Then we have 


P 



ies 2 



< 2pexp < — |S'! min 


\4cr 2 ’ 27 ) 
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Logistic Regression 



Rank=3 

Rank=10 

Rank=20 

Method 

Elapsed (sec) 

Iter 

Elapsed(sec) 

Iter 

Elapsed(sec) 

Iter 

NewSamp 

26.412 

12 

32.059 

15 

55.995 

26 

BFGS 

50.699 

22 

54.756 

31 

56.606 

34 

LBFGS 

103.590 

47 

64.617 

37 

107.708 

67 

Newton 

18235.842 

449 

35533.516 

941 

31032.893 

777 

GD 

345.025 

198 

322.671 

198 

311.946 

197 

AGD 

449.724 

233 

436.282 

272 

450.734 

290 


Support Vector Machines 



Rank=3 

Rank= 

10 

Rank=20 

Method 

Elapsed(sec) 

Iter 

Elapsed (sec 

Iter 

Elapsed (sec) 

Iter 

NewSamp 

47.755 

8 

52.767 

9 

124.989 

22 

BFGS 

13352.254 

2439 

10672.657 

2219 

21874.637 

4290 

LBFGS 

326.526 

67 

218.706 

44 

275.991 

55 

Newton 

775.191 

16 

734.480 

16 

4159.486 

106 

GD 

1512.305 

238 

1089.413 

237 

1518.063 

269 

AGD 

1695.44 

239 

1066.484 

238 

1874.75 

294 


Table 2: Details of the simulations presented in Figures [3j 


Lemma E.4. Let Z be a random variable with a density function f and cumulative distribution 
function F. If F c = 1 — F, then, 


®[^l{|z|>t}]| < tF’d-Z'l > t) + 


¥(\Z\ > z)dz. 


Proof. We write, 


pOO p — t 

E [ zl {\z\>t}] = zf(z)dz+ / zf(z)dz. 

J t J — OO 


Using integration by parts, we obtain 


J zf(z)dz 


zF c (z) + 


F c (z)dz, 


=zF(z) 


F(z)dz. 


Since Hindoo zF c (z) = liin^-oo zF(z) = 0, we have 

poo poo 

/ zf(z)dz =tF c (t ) + / F c (z)dz , 

Jt Jt 

f zf(z)dz = — tF(—t) — f F(z)dz, 

J— oo J—oo 

/ OO 

F(-z)ds 
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CT Slices Dataset 



LR 

SVM 

Method 

Elapsed (sec) 

Iter 

Elapsed (sec) 

Iter 

NewSamp 

9.488 

19 

22.228 

33 

BFGS 

9.568 

38 

2094.330 

5668 

LBFGS 

51.919 

217 

165.261 

467 

Newton 

14.162 

5 

58.562 

25 

GD 

350.863 

2317 

1660.190 

4828 

AGD 

176.302 

915 

1221.392 

3635 


MSD Dataset 



LR 

SVM 

Method 

Elapsed(sec) 

Iter 

Elapsed(sec) 

Iter 

NewSamp 

25.770 

38 

71.755 

49 

BFGS 

43.537 

75 

9063.971 

6317 

LBFGS 

81.835 

143 

429.957 

301 

Newton 

144.121 

30 

100.375 

18 

GD 

642.523 

1129 

2875.719 

1847 

AGD 

397.912 

701 

1327.913 

876 


Synthetic Dataset 



LR 

SVM 

Method 

Elapsed(sec) 

Iter 

Elapsed (sec) 

Iter 

NewSamp 

26.412 

12 

47.755 

8 

BFGS 

50.699 

22 

13352.254 

2439 

LBFGS 

103.590 

47 

326.526 

67 

Newton 

18235.842 

449 

775.191 

16 

GD 

345.025 

198 

1512.305 

238 

AGD 

449.724 

233 

1695.44 

239 


Table 3: Details of the experiments presented in Figure [2] 
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Hence, we obtain the following bound, 


/*oo /*oo 

|E[Zl { , Z | >t} ]| = tF c (t)+ / F c (z)dz — tF(—t) — / F(-z)dz 
<f (F C (t) + F(-f)) + Qf°° F c (z) + F(-z)dz 

/ OC 

P(|Z| > z)dz. 


□ 


Lemma E.5. For a,b > 0, and e satisfying 


e = 



and 


2 

a 


b 2 > e, 


we have e 2 > a\og(b/e). 

Proof. Since a, 6 > 0 and x —>• e x is a monotone increasing function, the above inequality condition 
is equivalent to 

2e 2 2 a 2 2b 2 

—e “ > —. 
a a 


Now, we define the function f{w) = we w for w > 0. / is continuous and invertible on [0,oo). Note 
that / _1 is also a continuous and increasing function for w > 0. Therefore, we have 




Observe that the smallest possible value for e would be simply the square root of af 1 (2 b 2 /a) /2. 
For simplicity, we will obtain a more interpretable expression for e. By the definition of f~ l , we have 

log(/ _1 (2/)) + f~ l {y) = log (y). 

Since the condition on a and b enforces f^ 1 (y) to be larger than 1, we obtain the simple inequality 
that 


/ 1 (y) < log (y). 


Using the above inequality, if e satisfies 


e 


2 




we obtain the desired inequality. 


□ 
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Value Value 


Spiked-model dataset CT-Slice dataset 




MSD-Song dataset Covertype dataset 




Figure 4: The plots demonstrate the behavior of £1 and £2 over several datasets. 


31 








